ggplot2 and choosing the right
visualization for your audience“Advanced Graphics and Data Visualization in R” is brought to you by the Centre for the Analysis of Genome Evolution & Function’s (CAGEF) bioinformatics training initiative. CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. Many of the datasets and examples used in this course will be drawn from real-world datasets and the techniques learned herein aim to be broadly applicable to multiple fields.
This lesson is the second in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.
The structure of the class is a code-along style in R markdown notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto datatools Hub so students can program along with the instructor.
This week will delve into some helpful visualization available
through the ggplot2 package. First we’ll go over deciding
which visualizations are best for what we want to accomplish and then
explore some of these in more detail.
At the end of this lecture you will have covered the following topics
grey background - a package, function, code, command or
directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or
folder
bold - heading or a term that is being defined
blue text - named or unnamed
hyperlink
... - Within each coding cell this will indicate an area
of code that students will need to complete for the code cell to run
correctly.
Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn R
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.
Each week, new lesson files will appear within your RStudio folders.
We are pulling from a GitHub repository using this Repository
git-pull link. Simply click on the link and it will take you to the
University of Toronto datatools
Hub. You will need to use your UTORid credentials to complete the
login process. From there you will find each week’s lecture files in the
directory /2025-03-Adv_Graphics_R/Lecture_XX. You will find
a partially coded skeleton.Rmd file as well as all of the
data files necessary to run the week’s lecture.
Alternatively, you can download the R-Markdown Notebook
(.Rmd) and data files from the RStudio server to your
personal computer if you would like to run independently of the Toronto
tools.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
At the end of each lecture there will be a completed version of the lecture code released as an HTML file under the Modules section of Quercus.
Today’s datasets will continue to focus on using the Ontario public sector salary disclosure, also known as the “Sunshine list”. This list, started in 1996, publishes all public sector servants that with an annual salary at or above $100,000. Although not strictly biological data, this is a great dataset to work with because it contains many observations set across a long time period with enough data to help generate subgroups based on sector, employer, and role!
You can find more information about this dataset on the Ontario public sector salary disclosure webpage
This dataframe will be found in the Lecture02.RData file
and contains and full version of our dataset from last week. While we
were unable to practice our wrangling on the data from all years, we can
load a full version that has been saved in a more compact binary
file!
This is a version of the Sunshine list ranging from 1996-2023. It has been altered by replacing all names with random numeric identifiers and contains nearly 2.5M observations.
tidyverse which has a number of packages including
dplyr, tidyr, stringr,
forcats and ggplot2
viridis helps to create color-blind palettes for our
data visualizations
ggbeeswarm, and ggridges are new packages
we need to install. They’ll help us generate some nice graphs.
# ---------- Install packages here ---------- #
# install.packages("ggbeeswarm", dependencies = TRUE)
# ---------- Source packages here ---------- #
# Packages to help tidy our data
library(tidyverse)
# Packages for the graphical analysis section
library(viridis)
# New visualisation packages
library(ggridges) # ridgeline plots
## Error in library(ggridges): there is no package called 'ggridges'
library(ggbeeswarm)
How do we extract meaningful insights from our data? If you have previously explored Anscombe’s quartet you’ll know that, as scientists and lay people, we can sometime be obsessed with summary statistics - mean, median, mode, standard deviation. While these values are a helpful way to quickly assess a population, they can be flawed to the point of deception. Instead, we should temper our investigations by visualizing our data. A deeper understanding of your data trends and potential models comes from dissecting attributes of your data which can jump out more easily through visualization.
Equally important is the ability to convey our findings to others. The right visualizations, whether simplistic or complex, should effectively communicate our key message.
One approach to effective data visualization relies on the Grammar of Graphics framework originally proposed by Leland Wilkinson (2005). The idea of grammar can be summarized as follows:
The grammar of graphics facilitates the concise description of any
components of any graphics. Hadley Wickham of tidyverse
fame has proposed a variant on this concept - the layered
grammar of graphics framework. By following a layered
approach of defined components, it can be easy to build a
visualization.
The Major Components of the Grammar of Graphics by Dipanjan Sarkar
We can break down the above pyramid by the base components, building from the bottom upwards.
1. Data: your visualization always starts here. What are the dimensions you want to visualize. What aspect of your data are you trying to convey?
2. Aesthetics: assign your axes based on the data dimensions you have chosen. Where will the majority of the data fall on your plot? Are there other dimensions (such as categorically encoded groupings) that can be conveyed by aspects like size, shape, colour, fill, etc.
3. Scale: do you need to scale/transform any values to fit your data within a range? This includes layers that map between the data and the aesthetics.
4. Geometric objects: how will you display your data
within your visualization. Which geom_* will you use?
5. Statistics: are there additional summary statistics that should be included in the visualization? Some examples include central tendency, spread, confidence intervals, standard error, etc.
6. Facets: will generating subplots of the data add a dimension to our visualization that would otherwise be lost?
7. Coordinate system: will your visualization follow a classic cartesian, semi-log, polar, etc. coordinate system?
sunshineFinal.df covers 28 years of dataLet’s look at a full version of our data from last lecture. With this full dataset we can better appreciate some of the figures and visualizations we created last week but also improve upon them based on today’s lecture material.
We’ll begin by loading the data into our session from an
.RData file.
# Read in sunshine data
load("./data/Lecture02.RData")
# Check the structure and preview the data
str(sunshineFinal.df)
## tibble [2,447,012 x 7] (S3: tbl_df/tbl/data.frame)
## $ numericID : Factor w/ 516966 levels "10000005","10000011",..: 475892 32433 382678 443816 443816 443816 443816 443816 394094 394094 ...
## $ salary : num [1:2447012] 194890 115604 149434 109383 173225 ...
## $ taxableBenefits: num [1:2447012] 711 403 513 4922 4939 ...
## $ year : int [1:2447012] 1996 1996 1996 1996 1997 1998 1999 2000 1996 1997 ...
## $ sector : Factor w/ 37 levels "Colleges","Crown Agencies",..: 35 35 35 34 34 34 34 34 3 3 ...
## $ employer : chr [1:2447012] "Addiction Research Foundation" "Addiction Research Foundation" "Addiction Research Foundation" "Agriculture,Food And Rural Affairs" ...
## $ title : chr [1:2447012] "President & Ceo" "Dir., Soc. Eval. Research & Act. Dir., Clin. Research" "V.p., Research & Coordinator, Intern. Programs" "Deputy Minister" ...
head(sunshineFinal.df)
Looking at the data itself, we have:
How many unique individuals are there? How many unique job titles are listed in our dataset?
# Calculate unique individuals
sunshineFinal.df %>%
pull(numericID) %>%
unique() %>%
length()
## [1] 516966
# Calculate unique job titles
sunshineFinal.df %>%
pull(title) %>%
unique() %>%
length()
## [1] 211182
There look to be about 520K individuals and 200K unique job titles in our dataset of 2.5M entries. When we think about plotting our individual data points, that’s a lot of data!
Before we continue on our journey, a quick look under the hood shows
that when plotting datapoints, ggplot will plot each point
individually! That means if we were to plot all of our observations in
the sunshine list, we would be plotting some 2.5M datapoints which can
take some time to render. Instead, we will summarize our data by
grouping it by individuals. From this, we can gather a few interesting
points as well such as how long an individual has been on the list!
# 1.2.1 Generate a summary dataset for individuals across years
# Save our summary to a new variable
sunshineSummary.df <-
# Pass along the dataset
sunshineFinal.df %>%
# Group based on individual (numericID)
group_by(numericID) %>%
# Summarize by capturing ranges and means of salary and taxable benefits
summarize(minSalary = min(salary),
maxSalary = max(salary),
meanSalary = mean(salary),
minTaxableBenefits = min(taxableBenefits),
maxTaxableBenefits = max(taxableBenefits),
meanTaxableBenefits = mean(taxableBenefits),
numYears = n(), # How many years have they been on the list?
)
# Look at the summary we've created
str(sunshineSummary.df)
## tibble [516,966 x 8] (S3: tbl_df/tbl/data.frame)
## $ numericID : Factor w/ 516966 levels "10000005","10000011",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ minSalary : num [1:516966] 103334 101221 100971 100671 111280 ...
## $ maxSalary : num [1:516966] 122621 112936 100971 166600 168393 ...
## $ meanSalary : num [1:516966] 112261 107272 100971 127163 136591 ...
## $ minTaxableBenefits : num [1:516966] 181 433 708 0 679 ...
## $ maxTaxableBenefits : num [1:516966] 216 724 708 1030 1135 ...
## $ meanTaxableBenefits: num [1:516966] 196 586 708 543 966 ...
## $ numYears : int [1:516966] 6 5 1 6 12 13 4 6 1 2 ...
Okay, it appears that we’ve decreased our overall number of entries by ~80%! That should simplify our plotting needs. We’re now ready to start visualizing our data, but how will we go about doing it?
Let’s take a look at the following chart from from Dr. Andrew V. Abela.
From the flowchart above we’ll mainly explore looking at relationships and composition using our dataset as a basis for our visualizations.
We’ll cover distributions and comparison with a more complex dataset later in this lecture.
Depending on the nature of your data and its dimensions there are a number of plots to choose from to represent and convey relationships between your variables. These various plots can reveal trends, correlations, and ultimately relationships between variables.
In our first lecture, we briefly explored bar charts and a vertical version of scatterplots called the strip plot. As an initial form of graphical assessment today we’ll begin with scatterplots which build a framework for exploring our data further.
| Relationship Description | Types of charts |
|---|---|
| Finding correlations in your data | Scatterplot (biplot) |
| Strip plot | |
| Bubble chart | |
| Lineplot | |
| Heatmap | |
| Showing connections between groups | Arc diagram |
| Chord diagram | |
| Connection map | |
| Network diagram | |
| Non-ribbon chord diagram | |
| Tree diagram | |
| Show relationships and connections between the data | Heatmap |
| or showing correlations between two or more variables | Marimekko Chart |
| Parallel coordinates plot | |
| Radar chart | |
| Venn diagram |
We’ll focus on just two relational plots in this section: the scatterplot and it’s variant the bubblechart. Parallel coordinate plots will also make an appearance later on during this lecture.
geom_point()
graphsWhen we try to examine relationships, one direction we can take is to look for trends by comparing one variable against another. From an experimental standpoint we can graph our independent variable on the x-axis, looking for changes in our dependent variable/measurement on the y-axis. This is most easily accomplished when both of your variables are on a continuous scale.
When trying to show correlations in your data between two to three variables you can use scatterplots (also known as biplots). Remember there are some limitations when visualizing multi-dimensional data on a two-dimensional canvas. Overall these plots can convey to your audience the potential correlations between your variables or separations between groups based on their colour or shape.
alpha parameter to help visualize
overlapping data pointsLet’s begin with a scatterplot comparing mean salaries versus the
mean taxable benefits. We’ll use each individual observation as a
datapoint to compare the two values to see if perhaps, there is a
relationship in how one is calculated over the other. Of note, within
the geom_point() layer, we’ll be updating the
alpha parameter to change the transparency of our
points.
Using the “alpha” parameter: Note that
alpha = 1 will make completely opaque
points while alpha = 0 will make
completely transparent points. As points begin to overlap, they will
create increasingly opaque areas in your visualization. Generally an
opacity value of about 0.5-0.7 is a good place to start when working
with a moderate number of overlapping plots.
# 2.1.1 scatterplot of taxable benefits vs salary
sunshineSummary.df %>%
ggplot() +
# 2. Aesthetics
aes(x = meanSalary, y = meanTaxableBenefits) +
# set text size
theme(text = element_text(size = 10)) +
# 4. Geoms
geom_point(alpha = 0.5)
Hard to tell but it looks like perhaps as salary rises, the taxable benefit tends to decrease. Of course, we do have a major outlier in our data but that may be an anomaly, or some kind of stock split/dividend in this case.
Challenge yourself: how would you identify which of our observations this taxable benefits outlier belongs to?
The bubbleplot, as we’ll see, brings an additional dimension to your visualization by varying the size of your points based on a (usually) continuous variable. This can help to highlight underlying data trends in a more obvious fashion for your audience.
To achieve this we will set the size parameter in our
aesthetics aes() layer. By setting the size
parameter to a variable in our data, it will alter the size of our data
points. We will augment the results of this by providing a size range
through the scale_size() layer.
# 2.1.2 scatterplot of taxable benefits vs salary
sunshineSummary.df %>%
ggplot() +
# 2. Aesthetics
aes(x = meanSalary, y = meanTaxableBenefits,
size = numYears) +
# set text size
theme(text = element_text(size = 10)) +
# 3. Scaling
### 2.1.2 set the range for sizing our dots
scale_size(range = c(1, 5), name = "Years listed") +
# 4. Geoms
geom_point(alpha = 0.5)
While our bubbleplot does add some dimensionality to our visualization we can see a lot of values are bunched up together at the lower end of our scales and likewise we have a few values that are far out on our x/y axes. To even this out, we can transform our axes to display values on a log10 scale. We have two places where we can accomplish this:
log(case_count). Data
will be placed on a log value scale but may have less meaning to the lay
person.# 2.2.0 scatterplot of taxable benefits vs salary
sunshineSummary.df %>%
ggplot() +
# 2. Aesthetics
aes(x = meanSalary, y = meanTaxableBenefits,
size = numYears) +
# set text size
theme(text = element_text(size = 10)) +
# 3. Scaling
## set the range for sizing our dots
scale_size(range = c(1, 5), name = "Years listed") +
### 2.2.0 Set the x and y-axis to log scales
scale_x_log10() +
scale_y_log10() +
# 4. Geoms
geom_point(alpha = 0.2)
## Warning in scale_y_log10(): [1m[22m[32mlog-10[39m transformation introduced
## infinite values.
From our updated bubblplot it now becomes clear that there is an enormous range for taxable benefits regardless of salary. However, there appears to be a soft upper limit of about 105, even as we approach the higher ranges of salaries. We do also see another small problem that comes along with log transformation of our data.
Look closely at our y-axis range and you’ll see values that are below
1x100 . While there may be values less than 1 in our dataset,
we see many datapoints at or below 1x10-3 which is 0.001!
Less than a penny! The issue here is that many of our datapoints have a
meanTaxableBenefit value of 0. In R, if we attempt to take
log10(0) we get -Inf because the value can’t
exist but rather approaches negative infinity. This creates a value
range that doesn’t necessarily exist in our dataset.
Instead, we can make a quick replacement to our data before plotting by replacing any values less than 1 to a minimum value of 1 in our dataset! In this case, it won’t drastically affect our trends but it will drastically improve how our plot looks. We could be more careful and only update values of 0 as well but this will suffice.
Note that we’ll make quick use of the if_else() function
in our mutate() call.
# 2.2.1 scatterplot of taxable benefits vs salary, log-transformed without 0 values
sunshineSummary.df %>%
### 2.2.1 Mutate taxable benefit data to avoid values of zero
mutate(meanTaxableBenefits = if_else(meanTaxableBenefits < 1,
# add 1 to all mean taxable benefit values of 0
true = 1,
false = meanTaxableBenefits)) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = meanSalary, y = meanTaxableBenefits,
size = numYears) +
# set text size
theme(text = element_text(size = 10)) +
# 3. Scaling
# set the range for sizing our dots
scale_size(range = c(1, 5), name = "Years listed") +
## Set the x and y-axis to log scales
scale_x_log10() +
scale_y_log10() +
# 4. Geoms
geom_point(alpha = 0.2)
While we cannot conclude that more years working equates to higher benefits in this graph, we can see that generally higher salaries will have higher taxable benefits.
How do you interpret log-transformed data?: After log-transforming both axes it looks like the relationship between salary and taxable benefits is leaning towards linear. While the transformation has definitely improved the visualization, it has made the interpretation a little harder. Overall, in this situation, you’ll either want to model against the untransformed data or do the math correctly and recognize that the relationship is a power function:
\[\begin{equation*} \begin{aligned} \text{log} y &= B + a \text{log} x &\text{where B is the intercept of the vertical axis and a is the slope}\\[10pt] y &= 10^{(B + a \text{log} x)}\\[10pt] y &= 10^{B} 10^{a \text{log} x}\\[10pt] y &= 10^{B} x^{a} &\text{Recall that } y \text{ln} x = x^{y}\\[10pt] \end{aligned} \end{equation*}\]
While we can gather some information here, there are still perhaps too many data points for a proper interpretation. Some approaches we could take include random subsetting, subsetting from a block or binning our data through filtering.
slice() your dataOne approach to quickly subsetting your data is with the
slice() method which can take a start:end
argument which effectively takes rows from start to
end inclusive. You can also use it to quickly drop specific
rows from your dataset. In fact, there are many ways you can use
slice() and also many variants of
it in the dplyr package.
Let’s rerun our plot code but randomly
slice 1000 datapoints instead using the
slice_sample() function. Note that we must explicitly
assign the number of samples to slice with the parameter
n.
# 2.3.1 Slice and use 1000 random datapoints for our figure
# Set a seed for random sampling reproducibility
set.seed(2025)
sunshineSummary.df %>%
# Mutate taxable benefit data to avoid values of zero
mutate(meanTaxableBenefits = if_else(meanTaxableBenefits < 1,
# add 1 to all mean taxable benefit values of 0
true = 1,
false = meanTaxableBenefits)) %>%
### 2.3.1 Randomly slice 1000 samples
slice_sample(n=1000) %>%
ggplot() +
# 2. Aesthetics
aes(x = meanSalary, y = meanTaxableBenefits,
size = numYears) +
# set text size
theme(text = element_text(size = 10)) +
# 3. Scaling
## set the range for sizing our dots
scale_size(range = c(1, 5), name = "Years listed") +
## Set the x and y-axis to log scales
scale_x_log10() +
scale_y_log10() +
# 4. Geoms
geom_point(alpha = 0.2)
Looks like we still see a slightly rising pattern of taxable benefits with increasing salary, at least when we get into the very high ranges!
Whereas in the use of slice_sample() we were pretty
indiscriminately pulling observations from our data, what if we wanted
to compare a specific data range. For example, we could examine salary
values to within a smaller range like 100-150K and see how widely
taxable benefits can vary. To accomplish this, you need to combine
predicates within a filter() call.
# 2.3.2 Filter your data into a binned group
# Set a seed for random sampling reproducibility
sunshineSummary.df %>%
# Mutate taxable benefit data to avoid values of zero
mutate(meanTaxableBenefits = if_else(meanTaxableBenefits < 1,
# add 1 to all mean taxable benefit values of 0
true = 1,
false = meanTaxableBenefits)) %>%
### 2.3.2 Filter for salaries between 125K and 150K
filter(meanSalary >= 125000, meanSalary <= 150000) %>%
ggplot() +
# 2. Aesthetics
aes(x = meanSalary, y = meanTaxableBenefits,
size = numYears) +
# set text size
theme(text = element_text(size = 10)) +
# 3. Scaling
## set the range for sizing our dots
scale_size(range = c(1, 5), name = "Years listed") +
## Set the x and y-axis to log scales
scale_x_log10() +
scale_y_log10() +
# 4. Geoms
geom_point(alpha = 0.2)
Looking at a narrow range of all salaries between $125,000 and $150,000 it appears that the overall range of taxable benefits remains the same. One could argue, however, that the overall distribution of taxable benefits does appear to shift slighlty with increasing salary. The datapoints, however, are far too dense to determine this and eyeballing this isn’t the most accurate way to determining the distribution of our data.
geom_smooth() layerDepending on the density of your datapoints, one common method of visualizing a trend in your data is to add a line of best fit. While this often requires further interpretation on its own, it is a quick way to visually assess if there is any relationship between two variables on your biplot.
We can easily add a fitted line (aka a model) to our
plot using the geom_smooth() layer. Based on your
expectations, you can set themethod parameter to a specific
function to generate your model:
lm: linear models drawing a simple y = ax+b lineloess: local polynomial regression fitting which
equates to a determining slope of the line at each point \(x\) based on local/surrounding
datapoints!You can read more about your method options at the tidyverse website.
For now, let’s add a linear model to our plot and see if that gives
us any more perspective. At the same time, we’ll also move one of our
aes() settings from the aes() layer to
directly in the geom_point() layer. We’ll talk more about
why in Lecture 03!
# 2.3.3 Add a line of best fit
# Set a seed for random sampling reproducibility
sunshineSummary.df %>%
# Mutate taxable benefit data to avoid values of zero
mutate(meanTaxableBenefits = if_else(meanTaxableBenefits < 1,
# add 1 to all mean taxable benefit values of 0
true = 1,
false = meanTaxableBenefits)) %>%
# Filter for salaries between 125K and 150K
filter(meanSalary >= 125000, meanSalary <= 150000) %>%
ggplot() +
# 2. Aesthetics
aes(x = meanSalary, y = meanTaxableBenefits) +
# set text size
theme(text = element_text(size = 10)) +
# 3. Scaling
# set the range for sizing our dots
scale_size(range = c(1, 5), name = "Years listed") +
# Set the x and y-axis to log scales
scale_x_log10() +
scale_y_log10() +
# 4. Geoms
geom_point(alpha = 0.2, aes(size = numYears)) +
### 2.3.2 Add a line of best fit
geom_smooth(method = "lm")
## [1m[22m`geom_smooth()` using formula = 'y ~ x'
Definitely a weak upward trend in our slope. From here, you could investigate and generate the actual linear model to check how significant this relationship is! It would also be great to take a closer look at the distribution of our data on a visual level, but how do we accomplish that?
Many students are familiar with the classic bar chart. Thus far, we’ve already used it to some extent to look at our last lecture’s sunshine data. When comparing groups or populations for differences in their distribution, however, the bar chart falls quite short of the ideal.
When comparing distributions, we again are often concerned with summary statistics - mean, median, standard deviation, confidence intervals. The nature of our data, however, is often not continuous across an entire y-axis interval as a bar chart may suggest but rather our population is the result of values centred, with some variance, around a mean value.
Over the remainder of the lecture we will compare and contrast 4 types of distribution plots for their relevance and when they may be most useful to visualize our data. When applicable to our data visualizations, we will also display all of our data points to better illustrate our distribution versus the visualization. We will examine:
geom_bar() or
geom_col() to display dataAs we have already seen, the geom_bar() layer can be
used to display our data in multiple ways. The height of the bar is
proportional to either:
1. The number of observations of each group or
2. The sum of weights applied by a weight aesthetic
In our Lecture 01 examples we actually used the
value of our observations (which was a summary data table) to determine
height and proportions of our groups by setting the parameter
stat="identity". This was the application of a weighting
based on the values of the y-axis variable we chose.
A simpler way to accomplish this effect is with
geom_col(), which already uses stat_identity()
by default to calculate proportions.
| geom | Description | Requirements |
|---|---|---|
| geom_bar() | Counts observations in your data and (by default) determines height as a proportion of total (by default) | Only accepts x OR y parameter in
aes() |
| geom_col() | Uses y-axis values as the height of each bar. | Requires both x AND y parameters in
aes() |
Both of these plots use position="stack" by default and
proportions of height match observations or sums for multiple values
sharing the same x position. Such instances can be
displayed independently using position="dodge" or
position=dodge2. This is only helpful, however, when the
number of x values (ie categories) is lower, otherwise the
graph becomes crowded.
Let’s begin with something simpler. We’ll go back to our original
dataset, sunshineFinal.df and filter for just a subset of
our sectors (non-“Ministry/seconded” data) as well as narrow down our
scope to just salaries paid in 2023.
We’ll try to convey the total salary amount paid to each sector for
that year as a comparison to see which is “largest” using only the
geom_col() layer.
# 3.1.0 build a simply barchart of total salary per sector in 2023
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year %in% c(2023)) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x = sector, y = salary) +
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 4. Geoms
### 3.1.0 Add our bars
geom_col()
reorder() to sort your data as you plot
itIn the case of sunshineFinal.df we can do a little
better by ordering our x-axis of sector by total
salary. Since our data table is quite simple there are a number
of ways to accomplish a sort:
sector to a factor and use
fct_reorder() to sort our factors.reorder() function while generating the
plot.Let’s sort our sector by total salary in descending
order using our second choice reorder(). This
takes a few parameters as follows:
x: an atomic vector (usually a factor but not
required) that you want to reorder
X: a vector of the same length as x
whose element-matched values will be used to determine the order of
x
FUN: the function you want to apply to
X to reorder the values of x
Note that the output of this function will be a vector of type
factor for your values x.
# 3.1.1 reorder your barplot on the fly
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year %in% c(2023)) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
### 3.1.1 Use reorder() to alter how our data is displayed on the x-axis
aes(x = reorder(x = sector,
X = -salary, # Note the "-" here essentially reverses our salary values!
FUN = sum),
y = salary, fill = sector) +
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 4. Geoms
geom_col() ## Add our bars
From above, using the reordered barplots we can quickly gauge the difference between sectors as they get sorted by total salary payout. There’s not much need to colour by sector as well but it does add some emphasis to “School boards” and further enforces the idea that we have sorted by population size.
What are we missing from this visualization that would help the reader? It would be nice to know how many individuals make up each sector in that year. Knowing that size would inform a bit more on the nature of the data were are seeing. Thus we can answer, are “School Teachers” simply a much larger pool of individuals, hence their higher total salary?
fill parameter to show quantitative
dataOne simple way to enhance our barplot is through the use of fill
colour. By setting the fill parameter based on the number
of observations in each sector we can shade each bar for a visual
assessment of total individuals. The simplest way to accomplish this is
to summarise() our data by sector before passing it along
for visualization.
# 3.2.1 shade bars by a continuous value
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year %in% c(2023)) %>%
# Move our reordering of sectors into the data wrangling
mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>%
# Group based on sector
group_by(sector) %>%
# Summarize to get total Salar and population size
summarise(totalSalary = sum(salary),
sectorSize = n()) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = sector,
y = totalSalary,
fill = sectorSize) + ### 3.2.1 Set the fill colour to a variable!
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 4. Geoms
geom_col()
geom_* layers to more accurately
reflect valuesAs you can see from above the shading does help us determine that our largest overall payout does go to the largest group. However, unless you have a very good eye for shades, we can’t see the full detail of our sector sizes. We can still see that “School Boards” has a much higher number of employees in that sector but how similar in size are “Universities” through to “Legislative Assembly And Offices” on our figure?
Instead of fill colour, we can add a little detail by layering an
additional geom. In this case, we will use geom_point() to
add the population of each sector after scaling it
1:100,000.
Why scale sector size up 100,000? Why did we choose to scale the size of each sector to 1:100,000? This is more of a decision based on what we know about the data already. We already know our y-axis totalSalary data scales from 0 to 8x109. Knowing the populations of our sectors can range from 0 to 80000, it makes sense to try and get our values along the same scale.
# 3.2.2 add points to represent sector size
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year %in% c(2023)) %>%
# Move our reordering of sectors into the data wrangling
mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>%
# Group based on sector
group_by(sector) %>%
# Summarize to get total Salar and population size
summarise(totalSalary = sum(salary),
sectorSize = n()) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = sector,
y = totalSalary) +
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
guides(colour = guide_legend(title="Sector\nsize *10^5")) +
# 4. Geoms
geom_col() +
### 3.2.2 Add points to represent sector size
geom_point(aes(y=sectorSize*100000, colour="red"), size = 5)
Now we have more clarity on populations sizes between sectors, answering our question about the difference in sector sizes. We can see, for instance that the “Colleges” sector has more individuals than “Ontario Power Generation”, suggesting that the mean salary between individuals is higher in the latter group! To publish this figure we would want to fix a few additional things like the legend presentation, the axis names and their titles and perhaps spruce up the colours. We’ll drill into these ideas more during our next lecture!
On this same topic, let’s visit one last plot that can use some pieces from the two variants of barplots we’ve used. The lollipop graph clarifies that x-axis values are more singular in nature rather than spanning a range while still visually connecting those y-axis values to their x-axis categories.
It looks very much like it sounds, and we’ll add an extra twist to
ours by setting the point size to the sector size, giving it just a bit
more of informational dimension. To accomplish this visualization we’ll
combine a geom_point() with a
geom_segment().
Note that the geom_segment() layer requires pairs of
both the x and y values as arguments (x, xend,
y, and yend).
Aesthetics when working with multiple geoms: As you’ll see in the following code, we’ve made some adjustments due to working with multiple geom_* layers. Since there can be overlapping aesthetics between geoms, you need to be cognizant of their effects. Rather than set these parameters in the aes() layer, set them directly in their respective geoms.
# 3.2.3 Generate a lollipop chart
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year %in% c(2023)) %>%
# Move our reordering of sectors into the data wrangling
mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>%
# Group based on sector
group_by(sector) %>%
# Summarize to get total Salar and population size
summarise(totalSalary = sum(salary),
sectorSize = n()) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = sector,
y = totalSalary) +
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# Use this layer to set the size range of your datapoints
scale_size(range = c(1, 10), name = "Sector\nsize") +
# 4. Geoms
### 3.2.3 Make the stick of the lollipop
geom_segment(aes(x=sector, xend=sector,
y=0, yend=totalSalary)) +
### 3.2.3 Add points to represent sector size
geom_point(aes(size = sectorSize), colour = "orchid")
Let’s revisit a version of our graphs from the end of last lecture.
We’ll use sunshineFinal.df to begin exploring a slightly
bigger picture for our data, looking for trends that spanned over all
the years of our datasets.
In our first lecture we looked at the other helpful visualization that barplots can produce: composition and proportions. Here we combine the observations for each sectors to help convey an added dimension to our data.
The key to our following figure is in the aes() layer
where we set:
x: the x-axis values here are usually categorical or
discrete in nature
fill: by nature our bars will stack by some
categorical variable in our data
You’ll notice, however, that we do not set a y-axis
value. This is because geom_bar(), by default, counts the
observations of each subsection within each bar.
# 3.3.0 simple bar chart of sector size vs year
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE)) %>%
# Move our reordering of sectors into the data wrangling
mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = year, fill = sector) +
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 4. Geoms
# Start with a simple bar graph
geom_bar()
position parameter of geom_bar()Taking a quick look at the overall result, there’s a lot of information here. From this graph, we can quickly see just how many salaries have increased to above $100K over nearly 3 decades. We have gone from less than 10K individuals to nearly 300K in 2023. The problem with our figure, however, is that because of how it has scaled so drastically, we cannot really pull much meaningful subgroup proportion information from 1996 vs what we can see more distinctly in our 2023 proportion.
Instead, let’s take advantage of the geom_bar()
parameter position which has some possible values including
via layer
position adjustments:
position_stack(): this is the default and what see
above where each fill category has been stacked one upon
another
position_dodge(): this separates each fill category
into it’s own separate small bar
position_fill(): this can give proportions of the
whole for each subgroup at every x-position on the figure. This
essentially normalizes each x-category position within itself.
Let’s give position_fill() a try and see how that
updates our figure
# 3.3.1 simple bar chart of sector size vs year
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE)) %>%
# Move our reordering of sectors into the data wrangling
mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = year, fill = sector) +
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 4. Geoms
### 3.3.1 present each sector as a proportion of the whole for each year on the x-axis
geom_bar(position = position_fill())
What a difference that single parameter can make. Now, we can see more interesting details like:
A major change to proportions in sector sizes between 2002 and 2003. This is likely to reflect a re-organization of the sectors overall. “Hospitals” pretty much disappears and we are “Hospitals and Boards of Public Health” appears to replace it. “Ontario Public Sector” is likely broken down into some other sectors as it virtually disappears in 2003 as well.
2021-2023 does see an uptick in the proportion of “School Boards” positions. From our current graph, however, we must recall that all things are calculated as a proportion and that what we see is on relative scale.
xlim()From above, to avoid some of the strange proportioning we get from 1996-2002, you have a couple of easy options on how you could trim your data:
Use the filter() layer to limit the dates you’d like
to see.
Change the x-axis limits using the xlim() layer. For
this, you simply need to supply the left and right limits as values.
Note that the order of your values does matter! You can achieve similar
control with the y-axis and the ylim() layer.
Filtering versus axis limits: When you
filter() your data, you may be looking for observations in
a specific range (ie 1-10) but only get observations returned that exist
as a subset within that range (ie 4-9). When you plot your data,
depending on the data type ggplot will only plot to the upper (and
sometimes lower) range of your data. In our current example, you would
expect to see an axis ranging 1-10 based on your filter, but ggplot
might only produce 0-9 or even 4-9!
In the case of xlim() or ylim() this sets a
hard range on your data to be displayed, whether or not your have
observations that fall outside of that range as well! Any data that
outside the specified range(s) will simply not be plotted. In that case,
you may miss unexpected data! However, in a series of figures, setting
the same x- or y-axis limits can allow your visualizations to be more
consistent and uniform looking. Having the same y-axis limit, for
instance, can also help your readers better compare the difference
between two experiments so the effect of your different conditions is
much more visible.
Know your geoms, axis types and their limits: A quick caution here that depending on the type of geom you are using and the data type for your axis, ggplot can inadvertently cut your data off. Due to the nature of barplots, for instance, if your x-axis is numeric as a data type, then each vertical bar is centred on a value +/- 0.5. If your limit is set at the same value as a column, then said column will not be rendered using default geom parameters.
# 3.3.2 trimmed bar chart of sector size vs year
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE)) %>%
# Move our reordering of sectors into the data wrangling
mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = year, fill = sector) +
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 3. Scaling
### 3.3.2 Limit our x-axis to between 2003 and 2023 (why do we add 1 on each side?)
xlim(2002, 2024) +
# 4. Geoms
# present each sector as a proportion of the whole for each year on the x-axis
geom_bar(position = position_fill())
## Warning: [1m[22mRemoved 47499 rows containing non-finite outside the scale
## range (`stat_count()`).
## Warning: [1m[22mRemoved 9 rows containing missing values or values outside
## the scale range (`geom_bar()`).
Some last things to keep in mind when working with your barplots is
that you are not explicitly required to stack them. You can in fact,
split your subgroups into a side-by-side comparison. In this case, we
could see all sectors are their own barplots for each year! This can
make most sense when you are working with smaller x-axis ranges and can
give you a bit of different spin on how your plots look. To accomplish
this, use the position parameter and set it to
position_dodge() or position_dodge2() instead.
The latter is especially helpful when dodging for
geom_boxplot() which may have variable widths. We’ll get
more into that later!
At the same time, we’ll explore the width parameter
(default value = 0.9) which allows you to alter the thickness of your
bars. Sometimes this just allows you to add more space between your bars
if it feels overcrowded, but can also come in handy when dodging your
barplots. You can also use this parameter to make your bars overlap!
# 3.3.3 Dodging your barplots
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE)) %>%
# Move our reordering of sectors into the data wrangling
mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = year, fill = sector) +
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 3. Scaling
# Limit our x-axis to between 2003 and 2023 (why do we add 1 on each side?)
xlim(2002, 2023) +
# 4. Geoms
### 3.3.3 Set your position parameter
geom_bar(position = position_dodge(), width = 0.8)
## Warning: [1m[22mRemoved 47499 rows containing non-finite outside the scale
## range (`stat_count()`).
## Warning: [1m[22mRemoved 11 rows containing missing values or values outside
## the scale range (`geom_bar()`).
While we could probably also see some of these trends with a lineplot, we can now see that year-over-year each sector has been growing in the number of individuals over the 100K mark. The only exceptions we can quickly spot are judiciary - which has held pretty steady for many years likely due to the nature of the size of the court system and how many judges are available for instance! We also see that School Boards have failed to grow over the last 3 years although a deeper dive could reveal any number of factors - a steady state of retirement balancing out promotions and wage increases, or a freeze on hiring concurrent with a wage freeze or some combination of those possibilities!
Note how altering the width parameter doesn’t change the
distance between our mini-bars but it does alter the distance between
our groups by year!
coord_polar()Generally speaking there are a number of circular or polar coordinate
plot variants ranging in complexity from the pie chart to a racetrack
chart. The coord_polar() layer takes a few parameters:
theta: determines if angles will be mapped to the x or
y variablestart: offset from the “12 o’clock” positiondirection: 1 = clockwise, -1 = counter-clockwiseHere’s a summary
| Chart name | Plot description | Uses | Cartesian analog |
|---|---|---|---|
| Pie chart | Sliced up proportions of a circle | Simple proportion comparison between groups | single bar variant |
| Nightingale/Coxcomb plot | Intervals are equal wedges but shaded areas represent proportions | Intervals with additional categorical proportions | Stacked column plot, theta=x |
| Race track plot | Intervals are split concentrically to form rings of equal width | Helpful if some groups are less diverse | Stack column plot, theta=y |
From our pie chart description and from your own experience, you can recall that pie charts aren’t helpful with complex data. Once the number of categories surpasses 6-8 groups, things can get hard to visually digest - especially if one category dominates the others.
Let’s continue with sunshineFinal.df but focus in again
on 2023 to get a breakdown of sector salary total payout. We’ll group
and summarise our data again before passing it along to
ggplot(). The key here will be converting a
geom_bar() barplot to a pie chart using the
coord_polar() layer. The coord_polar() layer
will take an argument for the parameters theta which can be
set to the x- or y-axis.
To accomplish this feat it helps to think about how the data is being handled. If you think about a pie chart, you can imagine it much like a stacked barplot where the top and bottom have been curved around to meet each other so we are connecting our bar along the y-axis! This is a plot all about proportions and instead of having an x-axis defined, it is the proportions for a single category or group.
Therefore, we set the aes() parameter x to
a value of "" so there are no categorical values for
x but you will still need to set a y value
from which to determine those proportions.
Note that we are introducing two common layers to help with titles in
our figure: ggtitle() and guides(). We won’t
discuss these in detail here but they can basically be add titles to
your plot and legend respectively.
We will also briefly introduce scale_fill_viridis_d() to
our figures. This layer comes from the viridis package
which contains a number of colour-blind friendly patterns that can also
be used to create more grey-scale accurate colour gradients as well!
# 3.4.1 Building a simple pie chart
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year == 2023) %>%
# Move our reordering of sectors into the data wrangling
mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>%
# Group based on sector
group_by(sector) %>%
# Summarize to get total salary and population size
summarise(totalSalary = sum(salary),
sectorSize = n()) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = "", y = totalSalary, fill = sector) +
theme_void() + # This will remove all background theme objects
theme(text = element_text(size = 14)) + # set text size
guides(fill = guide_legend(title="Public Sector")) +
ggtitle("Proportion of total salaries paid by sector in 2023") +
# 3. Scaling
scale_fill_viridis_d() + # the "d" stands for discrete colour scale
# 4. Geoms
# Set up our barplot here
geom_bar(stat="identity",
colour = "white") + # Set the colour between segments
# 7 Coordinate system
### 3.4.1 Map angles to the Y-axis
coord_polar(theta="y")
Made famous by Florence Nightingale, these plots (which she named coxcomb plots) were used to help emphasize the proportions of soldier deaths due to infection during the Crimean war. Each circular increment is equally spaced to represent increasing values along the y-axis. Visually, this gives more area representation as we radiate outwards on the chart. This produces a chart that
Florence Nightingale’s original rose chart/coxcomb plot publication. Sourced from the Wikimedia commons
Know your y-limits! Outer segments disproportionately represent increases in value. Their larger area produces more visual emphasis despite the linear scale of the y-axis. Due to the disproportionate area increases with increasing y-axis values, you need to consider the range of your data.
Let’s extend our pie chart into a coxcomb chart where we’ll explore data from 2011 to 2023.
We’ll add in the xlab() and ylab() layers
so we can rename our x- and y-axis titles as well. You may also notice a
couple of extra calls to the theme() layer and we’ll spend
plenty of time next lecture discussing how to best uses these layers to
customize your plots!
# 3.4.2 Building our first coxcomb plot
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year %in% c(2011:2023),
as.integer(sector) %in% c(1:7)) %>%
# Move our reordering of sectors into the data wrangling
mutate(sector = reorder(x = sector, X = -salary, FUN = sum),
year = factor(year)) %>%
# Group based on sector
group_by(sector, year) %>%
# Summarize to get total Salary and population size
summarise(totalSalary = sum(salary),
sectorSize = n()) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = year, y = sectorSize, fill = sector) +
theme_bw() + # make the theme more plain
theme(text = element_text(size = 20)) + # set text size
theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
guides(fill = guide_legend(title="Public Health Unit")) +
xlab("Year") + # Set the x-axis label
ylab("Sector Size") + # Set the y-axis label
ggtitle("Sector size 2011-2023") +
# 3. Scaling
scale_fill_viridis_d() + # the "d" stands for discrete colour scale
# 4. Geoms
geom_col(width = 0.9) + # Squeeze our bars together
# 7 Coordinate system
### 3.4.2 Map angles to the x-axis
coord_polar(theta = "x")
## [1m[22m`summarise()` has grouped output by 'sector'. You can override using
## the `.groups` argument.
Looking at our Nightingale plot, we can see how much emphasis is put on larger portions within each stacked bar. Just be careful, because:
Too many x-axis categories can overcrowd your plot
Having a large range of values in your y-axis (especially between x-axis categories) can also make it very hard to read your/distinguish smaller parts of your stack.
You want to make another cool chart from a boring bar chart. Stacked
or otherwise, you can convert those bar plots to a racetrack or radial
bar chart. The only difference between the coxcomb plot and radial bar
chart is that instead of the default theta="x" we set it to
y. Like flipping the coordinates for a bar chart.
Beware the racetrack transformation: This is an aesthetic transformation where each bar gets relatively longer as you move from inside to out. Therefore, values must be judged by radians! Our eyes can more precisely judge differences on a Cartesian bar chart. Thus when viewing these types of charts, don’t be misled by how they look at a casual glance!
Let’s make a set of barchart data and compare it to the racetrack
plot. Note that negative values in our dataset are plotted separately.
It is an implementation quirk of ggplot.
# 3.4.3 Generate a coord-flipped cartesian plot of our last figure
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year %in% c(2011:2023),
as.integer(sector) %in% c(1:7)) %>%
# Move our reordering of sectors into the data wrangling
mutate(sector = reorder(x = sector, X = -salary, FUN = sum),
year = factor(year)) %>%
# Group based on sector
group_by(sector, year) %>%
# Summarize to get total Salary and population size
summarise(totalSalary = sum(salary),
sectorSize = n()) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = year, y = sectorSize, fill = sector) +
theme_bw() + # make the theme more plain
theme(text = element_text(size = 20)) + # set text size
theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
guides(fill = guide_legend(title="Public Health Unit")) +
xlab("Year") + # Set the x-axis label
ylab("Sector Size") + # Set the y-axis label
ggtitle("Sector size 2011-2023") +
# 3. Scaling
scale_fill_viridis_d() + # the "d" stands for discrete colour scale
### 3.4.3 Flip our bars to the y-axis
coord_flip() +
# 4. Geoms
geom_col(width = 0.9) # Squeeze our bars together
## [1m[22m`summarise()` has grouped output by 'sector'. You can override using
## the `.groups` argument.
Now just imagine what our figure would look like if we could connect the right side of the bars to the left side!
# 3.4.3 Switch up the plot to a racetrack plot
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year %in% c(2011:2023),
as.integer(sector) %in% c(1:7)) %>%
# Move our reordering of sectors into the data wrangling
mutate(sector = reorder(x = sector, X = -salary, FUN = sum),
year = factor(year)) %>%
# Group based on sector
group_by(sector, year) %>%
# Summarize to get total Salary and population size
summarise(totalSalary = sum(salary),
sectorSize = n()) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = year, y = sectorSize, fill = sector) +
theme_bw() + # make the theme more plain
theme(text = element_text(size = 20)) + # set text size
theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
guides(fill = guide_legend(title="Public Health Unit")) +
xlab("Year") + # Set the x-axis label
ylab("Sector Size") + # Set the y-axis label
ggtitle("Sector size 2011-2023") +
# 3. Scaling
scale_fill_viridis_d() + # the "d" stands for discrete colour scale
# 4. Geoms
geom_col(width = 0.9) + # Squeeze our bars together
# 7 Coordinate system
### 3.4.3 Map angles to the y-axis
coord_polar(theta="y")
## [1m[22m`summarise()` has grouped output by 'sector'. You can override using
## the `.groups` argument.
Often for our purposes as scientists, we are more concerned with the distribution of replicates or measurements as a population within a group whilst also comparing across other groups (ie control vs treatment). Naively, we are tempted by the convenience of Excel to simplify our data as a bar chart with some simple standard deviation or error bars. One word concisely describes this behaviour:
\[\text{Inappropriate}\]
Note from our examples that although we are comparing sectors, our datapoints represent single or total observations separated either by group or by time. At each x-axis point we are not comparing the distribution between categories in a meaningful way. In fact, from our first example, we can convey a near-exact message using a dot-plot! In essence these visualizations are communicating the categorization of samples across multiple groups. More or less, this is about visualizing proportions.
Section 3.0.0 Comprehension Question: After exploring the racetrack variant above, visualizing our data ranging from 2011 to 2023, what is the biggest issue you might see? What would be one way to remedy this? Is this an appropriate visualization of our data?
# Code your plot here!
What is the shape of our data? When we encounter experimental populations and begin sampling or measuring for specific characteristics, we inevitably begin to reveal whether or not that characteristic has a predictable range, median value, etc. Density plots, once given enough sample values, begin to predict the shape or distribution of that sampling. We sometimes use this to represent the theoretical distribution of our original populations.
In contrast to our previous datasets, we are interested in dissecting group characteristics after sampling multiple times. In this case, we have an interesting dataset for comparison because we can look at the distribution of sunshine salaries across each sector in our data. Are some sectors paid more or less or have a wider distribution of salaries?
We have many observations we can compare in each group (some sectors of course are larger than others) but these datapoints can give us a window into how our public servants are paid.
While bar plots help to focus on proportional representation for categorical data, both density plots and histograms can be used to convey the frequency or distribution of values for a variable across its specified range. When comparing distributions visualized this way, you can usually compare up to 3 or 4 on the same plot before the data becomes too crowded. These methods also give you a sense of your data before moving forward with more detailed analyses. You can also identify possible mistakes or artefacts in data collection.
How much data do I need? Density plots can be thought of as smoothed versions of histograms which have been binned in small intervals. Density plots of a single dimension require a minimum of 4 samples but justifying a KDE on a sample size that small is hard. Histograms are recommended to have at least 30 observations to be considered useful and I would apply this rule of thumb to KDEs as well.
Let’s pick a few sectors to plot based on salary. We’ll
re-filter our data to look at just a portion of sectors much like our
nightingale plots for just the year of 2023. In this figure we’ll
introduce 2 geoms:
geom_density(): this is the layer we can use to
create a kernel density estimate. In theory, the area under each curve
should be 1 as they represent the probability of any salary value within
a range of the data.
geom_rug(): this simple geom will sit outside our
plot in what would be considered the “margins” of the plot. Each
vertical line will represent an actual observation from the
data.
# 4.1.0 Generate a kernel density estimate with a rug
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year %in% c(2023),
sector %in% c("Colleges", "Universities", "Judiciary", "Hospitals And Boards Of Public Health")) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = salary, fill = sector) +
theme_bw() + # make the theme more plain
theme(text = element_text(size = 20)) + # set text size
theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
guides(fill = guide_legend(title="Public sector")) +
xlab("Salary") + # Set the x-axis label
ylab("Density") + # Set the y-axis label
ggtitle("Sector pay distribution for 2023") +
# 3. Scaling
scale_fill_viridis_d() + # the "d" stands for discrete colour scale
scale_colour_viridis_d() +
xlim(50000, 500000) +
# 4. Geoms
### 4.1.0 Add a density plot
geom_density(alpha = 0.5) + # Plot the density estimates
### 4.1.0 Add a rug plot to the x-axis margin
geom_rug(aes(colour = sector))
## Warning: [1m[22mRemoved 66 rows containing non-finite outside the scale range
## (`stat_density()`).
Looking at our 4 chosen sectors, we see some interesting density plots.
Colleges have a very tight distribution with a very steep drop-off around $150K
Universities, on the other hand, have a much longer tail leading outward, suggesting higher salary range, but also a more even distribution along that range. However the highest peak in density is at a smaller salary than those in the College group.
Hospitals and Boards of Public Health also have a similar mode for salary as Universities but with a much tighter range for the bulk of its distribution.
Judiciary is our most interesting set of data as it represents a bimodal distribution. There are two distinct groupings of salaries in the Judiciary system representing a “lower” range in the $180K range and a “higher” group in the $350K range.
after_stat() function to scale your
dataThere are a number of after_*() functions that can be
applied to your plots. The overall purpose of these is to influence the
aesthetics parameters (ie x, y, fill, colour, etc) based on some
post-geom() transformations of your data. For instance, we
know that geom_bar() counts up observations for category in
the x-axis. You can use after_stat(count) to extract the
total observations per x-axis category as a computed variable.
Looking carefully at the documentation of each geom_*(),
you will find which computed variables are calculated by each specific
geom. For geom_density() we can extract
density, number of points, and a scaled density estimate.
Let’s look at using after_stat(scaled) to set our y-axis
range of values in our previous geom to between 0 and 1.
# 4.1.1 normalize or scale our KDEs
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year %in% c(2023),
sector %in% c("Colleges", "Universities", "Judiciary", "Hospitals And Boards Of Public Health")) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = salary, fill = sector) +
theme_bw() + # make the theme more plain
theme(text = element_text(size = 20)) + # set text size
theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
guides(fill = guide_legend(title="Public sector")) +
xlab("Salary") + # Set the x-axis label
ylab("Scaled density") + # Set the y-axis label
ggtitle("Sector pay distribution for 2023") +
# 3. Scaling
scale_fill_viridis_d() + # the "d" stands for discrete fill scale
scale_colour_viridis_d() + # the "d" stands for discrete colour scale
xlim(50000, 500000) +
# 4. Geoms
### 4.1.1 scale the KDE plot
geom_density((aes(y = after_stat(scaled))), alpha = 0.5) + # Plot the density estimates
### Add a rug plot to the x-axis margin
geom_rug(aes(colour = sector))
## Warning: [1m[22mRemoved 66 rows containing non-finite outside the scale range
## (`stat_density()`).
By setting the after_stat() option in our density plot,
now all of our densities have a maximum height of 1. This can make it
easier to see all of our densities at the same height. This is
especially helpful when some are very skewed or have very long tails
making them much “lower” without scaling. Now you can more easily
estimate the overall distribution of your population along certain
points.
facet_*() to view KDEs separatelyAs you can see from above, if we were to have more and more sectors,
it may be better to separate them out in order to avoid too much
overlap. It may also be to our advantage to compare them in a more
vertical fashion. Recall that there are two layers we can choose from:
facet_wrap() and facet_grid(). They are
differentiated by the following characteristics:
facet_wrap(): Data is split based on available data
combinations of the facets parameter which can be defined
by a formula like ~var1+var2 or by using the quoting
function vars(var1, var2). In either case, the data is then
grouped by your variables. Panels are placed in a single ribbon
that wraps around based on the arguments in the ncol and
nrow parameters.facet_grid(): Data is split based on the 1
or 2-dimensional combination of facet variables. Even
combinations for which there is no data will be displayed as empty
panels. Use the rows and cols parameters to
set the facets by using the vars() quoting function. Note
that you could further group your rows and/or columns by multiple
variables.Note we’ll also use another parameter in both called
scales where we can determine if panels should share axis
scales or change them based on individual panel data in the y-axis
(free_y), x-axis (free_x) or both
(free),
Let’s first use facet_wrap() to generate a single-column
facet of KDEs based on sector pay for 2023 of our dataset.
# 4.1.2 Switch up the plot to a faceted KDE plot
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year %in% c(2023)) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = salary, fill = sector) +
theme_bw() + # make the theme more plain
theme(text = element_text(size = 20)) + # set text size
theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
guides(fill = guide_legend(title="Public sector")) +
xlab("Salary") + # Set the x-axis label
ylab("Density") + # Set the y-axis label
ggtitle("Sector pay distribution for 2023") +
# 3. Scaling
scale_fill_viridis_d() + # the "d" stands for discrete fill scale
xlim(50000, 400000) +
# 4. Geoms
### 4.1.2 Note how we no longer need to scale our y-axis?
geom_density(alpha = 0.5) + # Plot the density estimates
# 6. Facets
### 4.1.2 Add a simple facet wrap, with each sector group existing in its own row
facet_wrap(~sector,
scales="free_y",
ncol=1)
## Warning: [1m[22mRemoved 381 rows containing non-finite outside the scale
## range (`stat_density()`).
facet_grid() to view multiple periods of our
dataIf, for example we wanted to directly compare the data from two years
(2015 and 2023), we could alter the above plot so that our two years are
paneled beside each other. This can be accomplished with the
facet_grid() layer which will allow us to name both a
rows and cols grouping parameter.
Let’s update our figure so that we split our rows by
sector and our columns by year.
# 4.1.3 Switch up the plot to a racetrack plot
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year %in% c(2015, 2023)) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = salary, fill = sector) +
theme_bw() + # make the theme more plain
theme(text = element_text(size = 20)) + # set text size
theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
guides(fill = guide_legend(title="Public sector")) +
xlab("Salary") + # Set the x-axis label
ylab("Density") + # Set the y-axis label
ggtitle("Sector pay distribution 2015 vs. 2023") +
# 3. Scaling
scale_fill_viridis_d() + # the "d" stands for discrete fill scale
xlim(50000, 400000) +
# 4. Geoms
### 4.1.2 Note how we no longer need to scale our y-axis?
geom_density(alpha = 0.5) + # Plot the density estimates
# 6. Facets
### 4.1.3 Add a simple facet grid, with each sector group existing in its own row
# and two columns of year data
facet_grid(rows = vars(sector),
cols = vars(year),
scales="free")
## Warning: [1m[22mRemoved 575 rows containing non-finite outside the scale
## range (`stat_density()`).
While the above visualization is pretty clear, it certainly takes up a lot of extra space. Perhaps there is a better way to generate this kind of visualization?
Ridgeline plots (sometimes called Joyplots) can generate a compact way to show multiple distributions of numeric values across several groups. The distributions can be shown as either histograms or density plots with all of them aligned to the same horizontal axis. The vertical axis is compressed slightly to generate an overlap between categories.
To generate these visualizations we can use the ggridges
package which is an extension of ggplot2. In this case,
that means it uses the same grammar and can be added as a layer call to
geom_density_ridges(). A parameter to keep in mind:
scale: sets the vertical distance between
ridgelines
1: the tallest density curve just touches the baseline of the next vertical category
Above 1: increasing overlap
Below 1: increasing separation
Let’s begin by replicating our faceted plot from above which compares the 2015 vs 2023 salary distributions across sectors.
# 4.2.0 Generate our ridgeline plot to compress our KDEs
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year %in% c(2015, 2023)) %>%
# sector %in% c("Colleges", "Universities", "Judiciary", "Hospitals And Boards Of Public Health")) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = salary, y = sector, fill = sector) +
theme_bw() + # make the theme more plain
theme(text = element_text(size = 20)) + # set text size
theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
theme(legend.position = "none") + # drop the legend since it's redundant
guides(fill = guide_legend(title="Public sector")) +
xlab("Salary") + # Set the x-axis label
ylab("Density") + # Set the y-axis label
ggtitle("Sector pay distribution 2015 vs. 2023") +
# 3. Scaling
scale_fill_viridis_d() + # the "d" stands for discrete fill scale
xlim(50000, 400000) +
# 4. Geoms
### 4.2.0 Note how we no longer need to scale our y-axis? Instead we can scale for overlap
geom_density_ridges(scale = 1.5,
alpha = 0.5) +
# 6. Facets
### 4.1.2 Add a simple facet wrap, with each sector group existing in its own row
facet_grid(~year,
scales="free_x")
## Error in geom_density_ridges(scale = 1.5, alpha = 0.5): could not find function "geom_density_ridges"
geom_density_ridges_gradient() to fill
densities on a gradientFrom above you can now see that we’ve somewhat compacted all that dimensional data into a single plot that still clearly conveys the difference in overall salary distribution within sectors. The distributions across our categories suggest that the sectors whose distributions change the most over the past decade or so are the Colleges and School Board sectors.
For our audience, we would need to clean up our axis titles to
clarify that these proportions are calculated independently. An
additional option with your ridgeline plots is the fill variant. Right
now we are using a scale_fill_gradient_d() to generate
discrete colour fill for each of our density plots. To
accomplish a nicer horizontal gradient we will change the layer call to
scale_fill_viridis_c() since our x-axis is
continuous. Keep in mind that we also cannot set the
alpha transparency on our density plots when filling with a
gradient. We also have to set our aesthetics fill to
stat(x) to accomplish this feat as well.
# 4.3.0 Generate our ridgeline plot with a horizontal gradient
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year %in% c(2015, 2023)) %>%
# sector %in% c("Colleges", "Universities", "Judiciary", "Hospitals And Boards Of Public Health")) %>%
# 1. Data
ggplot() +
# 2. Aesthetics
aes(x = salary, y = sector, fill = stat(x)) +
theme_bw() + # make the theme more plain
theme(text = element_text(size = 20)) + # set text size
theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
theme(legend.position = "none") + # drop the legend since it's redundant
guides(fill = guide_legend(title="Public sector")) +
xlab("Salary") + # Set the x-axis label
ylab("Density") + # Set the y-axis label
ggtitle("Sector pay distribution 2015 vs. 2023") +
# 3. Scaling
### 4.3.0 Use the Magma option
scale_fill_viridis_c(name="Salary", option = "C") +
xlim(50000, 400000) +
# 4. Geoms
### 4.3.0 Switch our geom to the gradient version
geom_density_ridges_gradient() + # Plot the density estimates
# 6. Facets
# Add a simple facet wrap, with each sector group existing in its own row
facet_grid(~year,
scales="free_x")
## Error in geom_density_ridges_gradient(): could not find function "geom_density_ridges_gradient"
By altering how we fill our density estimates, we emphasize the x-axis range. Cooler colours for lower salaries and “hotter” colours for upper salary ranges. What this really highlights at a quick glance is the Judiciary sector. The salaries remain bimodal but they have clearly shifted over the decade by ~$50K. This is happening for both groups in the distribution but we only see a slightly smaller shift in our other sectors.
Of course there are other ways that you could compare these two sets of distributions as density plots. How would you approach overlapping salary distributions for different years atop each other?
Did you know the boxplot is nearly 50 years old! First invented in the 1970s by our favourite statistician, John Wilder Tukey, we’ll dig into how and when to use this iconic plot. While we’re here we’ll also take a look at other categorical distribution plots. While our KDE and ridgeline plots provide quite a bit of detail, they can also be a little more limited in their space efficiency. The following categorical distribution plots will perhaps provide some more information efficiency.
geom_boxplot()As you can see from the previous section, we comfortably fit a quite few distributions on a ridgeline plot. From the looks of it, school boards had one of the tightest distributions while the Judiciary sector had a strange bimodal split amongst its individual salaries.
Can we visualize the data in a more summarized form? Let’s explore the box plot.
The dissection of a boxplot’s components shows us how it summarizes data distribution.
Also known as the box and whisker plot, this visualization conveys the distribution of samples within a group or population and is built upon 5 principal values:
“Box”
median: dark line across centre of box
lower quartile: lower-bound of box (aka lower hinge)
upper quartile: upper-bound of box (aka upper hinge)
“Whiskers”
Together, the lower and upper quartiles produce the interquartile
range (IQR). The general implementation of boxplots classify any
observations 1.5 IQR above the upper quartile or below the lower
quartile as outliers of the distribution. The
characteristics of outliers can be set as parameters within the
geom_boxplot() layer. Parameters include
outlier.shape, outlier.size, and
outlier.colour.
Unlike a histogram, the minimum number of values to generate a boxplot is 5. While you could generate a boxplot on fewer numbers, you might not have actual whiskers! This is definitely a great alternative when sample sizes are between 5-30 for each population.
Historically this was a simple way to visualize summary statistics of populations while being easy to produce by hand. Of course, with the age of computing, the production of kernel density estimates have allowed for more diverse visualizations. This plot, however, remains a popular format and thus is more readily understood by general audiences.
Each compact box can take up the same space as a barplot column but
it gives much more information about the population. Let’s look at a
single aspect - salary in the year of 2023 across our usual
sectors.
# 4.2.0 Generate our ridgeline plot to compress our KDEs
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
year %in% c(2023)) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=sector, y = salary) +
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 3. Scaling
ylim(0, 1000000) +
# 4. Geoms
### 5.1.0 Add the boxplot geom
geom_boxplot()
## Warning: [1m[22mRemoved 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
From the looks of it, we can confirm what we saw before in our density plot - that most of the sectors have a pretty tight distribution with longer tails trailing to the higher salaries. Unsurprisingly, Judiciary again shows a different pattern with a much longer boxplot and no outliers reported. We’ll see why in a few sections.
A boxplot can also be updated with a few extra items to help us visualize the data:
add data points with geom_jitter() and remove
outliers from the boxplot to avoid double-plotting points.
mark a ~95% confidence interval within the shape of each box plot
with the notch parameter.
you can set a variable width for each boxplot that is based on the number of samples used to generate the plot. The wider the box, the more samples were used to generate it.
One thing you have to consider, however, is that when you have too many data points, plotting them can be very dense. Too many points and you can’t really discern any additional information outside of the original boxplot!
For our next example, we’ll reel in the number of observations by focusing on salaries in the University sector way back in 1996. We’ll plot salaries by employer in this case and see how the different institutions fare against each other ~30 years ago.
# 5.2.0 Generate a boxplot of university salaries from 1996
sunshineFinal.df %>%
# Filter for University data from 1996
filter(sector == "Universities", year == 1996) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=employer, y = salary) +
# Set some configurations for the plot
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
theme(legend.position = "none") + # drop the legend since it's redundant
# 3. Scaling
ylim(0, 300000) +
# 4. Geoms
geom_boxplot(outlier.shape=NA, ### 5.2.0 Remove outliers
notch = TRUE) + ### 5.2.0 Add a confidence interval notch
### 5.2.0 Add datapoints
geom_jitter(aes(colour = employer), alpha = 0.5)
## [1m[22mNotch went outside hinges
## [36mi[39m Do you want `notch = FALSE`?
## [1m[22mNotch went outside hinges
## [36mi[39m Do you want `notch = FALSE`?
## [1m[22mNotch went outside hinges
## [36mi[39m Do you want `notch = FALSE`?
## [1m[22mNotch went outside hinges
## [36mi[39m Do you want `notch = FALSE`?
## [1m[22mNotch went outside hinges
## [36mi[39m Do you want `notch = FALSE`?
## [1m[22mNotch went outside hinges
## [36mi[39m Do you want `notch = FALSE`?
## [1m[22mNotch went outside hinges
## [36mi[39m Do you want `notch = FALSE`?
## [1m[22mNotch went outside hinges
## [36mi[39m Do you want `notch = FALSE`?
## [1m[22mNotch went outside hinges
## [36mi[39m Do you want `notch = FALSE`?
geom_beeswarm takes plotting your points up to
the next levelAs you can see from above the geom_jitter() layer does
add points to our boxplot by plotting the points such that they avoid
overlapping as much as possible. Points are restricted to the width of
the boxplot although this can also be adjusted to some degree with the
right parameters. geom_jitter() is native to the
ggplot2 package with some parameters that allow for a more
“random” distribution of your data points within a provided area.
The goal of the ggbeeswarm package is to generate points
that will not overlap but they can also be used to
simultaneously simulate the kernel density of your data. There are two
geoms supplied that work with the ggplot2 package to
accomplish this. The first is geom_beeswarm(), which has a
number of parameters that can be used to set their aes()
mappings but also how the points are laid out:
priority: determines the method used to perform the
point layout.
groupOnX: if TRUE then jitter is added to the x axis
(default behaviour) otherwise jitter along the y axis.
dodge.width: the amount by which different
aesthetics groups (must be a factor) will be dodged.
show.legend: determines of the this layer should be
included in the legends.
NA (yes if aesthetics are mapped),
FALSE (never include), and TRUE (always
include)cex: an optional parameter that can be used to help
set the spacing between values.
You may also notice a number of warnings for our plot that our notches went outside the hinges. This means our notches and confidence intervals extend beyond one of the upper or lower quartiles. This problem is likely due to a wide distribution with only a few data points and can result in a strange looking box. Let’s turn it back off for now.
# 5.3.0 Generate a boxplot of university salaries from 1996 with geom_beeswarm
sunshineFinal.df %>%
# Filter for University data from 1996
filter(sector == "Universities", year == 1996) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=employer, y = salary) +
# Set some configurations for the plot
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
theme(legend.position = "none") + # drop the legend since it's redundant
# 3. Scaling
ylim(0, 300000) +
# 4. Geoms
geom_boxplot(outlier.shape=NA, # Remove outliers
notch = FALSE) + ### 5.3.0 Remove the confidence interval notch
### 5.3.0 Add datapoints as a beeswarm
geom_beeswarm()
geom_quasirandom() adds KDE structure to your
plottingAs we can see from above, the geom_beeswarm() layer adds
a little more structure to the data in a somewhat rigid way. Any
datapoints that are near each other are deliberately spaced out to
almost represent the distribution of your data. Of course, you may run
into some issues as your number of datapoints increases or as your data
range increases. We see this problem very clearly in our previous
plot.
To remedy this, we can balance the visualization a bit with the
geom_quasirandom() function.
geom_quasirandom() works similarly to the
geom_beeswarm() function with emphasis on an additional
method of how the points are plotted:
method: determine the method for distributing the
points. Options include:
varwidth: if TRUE, vary the width of each group based
on the number of samples in the group# 5.3.1 Generate a boxplot of university salaries from 1996 with geom_quasirandom
sunshineFinal.df %>%
# Filter for University data from 1996
filter(sector == "Universities", year == 1996) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=employer, y = salary) +
# Set some configurations for the plot
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
theme(legend.position = "none") + # drop the legend since it's redundant
# 3. Scaling
ylim(0, 300000) +
# 4. Geoms
geom_boxplot(outlier.shape=NA, # Remove outliers
notch = FALSE) + # Remove the confidence interval notch
### 5.3.1 Add datapoints as a quasirandom beeswarm
geom_quasirandom(width = 0.3, alpha = 0.2)
Now our data points take a more nuanced approach with a uniform width that shapes the data as a distribution. We’ll see this with even more emphasis in a few sections.
When given the need to show data distribution, try the quasirandom plotting of points over a simple beeswarm.
Is there more to the data we’ve visualized? For our dataset, a logical question to ask next would be how do these salaries compare across different years. We’ve done the same previously with our KDE plots so let’s try it with boxplots.
The kind of boxplot we will create are known as nested boxplots. Technically speaking, up to this point, our multi-boxplot visualizations have been “grouped” boxplots because we are exploring a single variable (salary), across multiple groups within another variable (University employers). Now we will further nest within the employer variable to look at a secondary variable (year) to break each employer into subgroups. We could do this with 2 or more years but we’ll use just 2 for now.
n_distinct() to find the number of unique
values in a groupWhen we compare 1996 to 2023 data, we need to keep in mind that a few things can happen. Universities can come and go or get amalgamated with other universities. In some cases, names can change too! To avoid these issues, we’ll do a simple series of filtering to look at Universities that are “unchanged” in name from 1996 to 2023.
First, we’ll filter our data to capture 1996 and 2023 data from
Universities. We’ll group_by() and summarise()
using a new function n_distinct() to count the number of
years in the group. It will be either 1 or 2 and we only want to keep
Universities with both 1996 and 2023 data. We’ll wrap it in a
conditional to create a logical variable called keep in our
summary. From this, we can create our curated list of Universities.
# 5.4.1 Make a curated list of Universities from both 1996 and 2023
universityList <-
sunshineFinal.df %>%
filter(sector == "Universities",
year %in% c(1996, 2023)) %>%
# Group by employer (ie University)
group_by(employer) %>%
# Are there observations for both years?
summarise(keep = n_distinct(year)>1) %>%
# Keep only the observations with 2 years
filter(keep == TRUE) %>%
# Grab the list of employers
pull(employer)
# Show the list
universityList
## [1] "Brock University" "Carleton University"
## [3] "Lakehead University" "Mcmaster University"
## [5] "Nipissing University" "Trent University"
## [7] "University Of Guelph" "University Of Toronto"
## [9] "University Of Waterloo" "University Of Windsor"
## [11] "Wilfrid Laurier University" "York University"
So it looks like we’ve narrowed our list of University employers to a set of twelve.
fill parameterFrom there we’ll use the fill parameter to generate
different subgroups in our boxplot to make a grouped boxplot. In doing
so, we’ll also have to add the use of the
geom_quasirandom() parameters:
dodge.width: separate the data points by any
aesthetic groups that have been assigned.
width: set the maximum spread of your data points
within each grouping
alpha: set the opacity of your data points so you
can see more of the overlapping data.
Remember how we said to avoid plotting all points when you have too
many? In this case, we can plot all of our points in a cleaner way than
geom_jitter() by using a geom_quasirandom() to
confine all our datapoints to a smaller area.
# 5.4.2 Generate a grouped and faceted boxplot
sunshineFinal.df %>%
# Filter for University data from 1996
filter(sector %in% c("Universities"),
year %in% c(1996, 2023),
employer %in% universityList) %>% ### 5.4.2 filter on our list of employers
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=employer,
y = salary,
fill = factor(year)) + ### 5.4.2 Set the fill colour based on year!
# Set some configurations for the plot
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 3. Scaling
ylim(0, 500000) +
# 4. Geoms
geom_boxplot(outlier.shape=NA,
notch = FALSE) +
### 5.4.2 Add datapoints as a quasirandom beeswarm
geom_quasirandom(dodge.width = 0.7, # How far apart to set our groups
width = 0.15, # How wide to make each beeswarm
alpha = 0.05) # Set the opacity very low since we expect a lot of points!
## Warning: [1m[22mRemoved 19 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: [1m[22mRemoved 19 rows containing missing values or values outside
## the scale range (`position_quasirandom()`).
In our boxplots, we are plotting the data in both a boxplot and dotplot format. The shape of the dots helped to give a better sense of salary distribution within each subgroup. You can see that the data points overlay on the box but also fall outside. Can we get the compact nature of the boxplot while still getting the visual appeal generated by a density plot?
As the title says, the violin plot is a mixture of both the boxplot
and kernel density plot. It’s a miniaturized KDE that is mirrored across
its axis. It encompasses the summary information of the boxplot but in a
pear or violin-shaped distribution. To generate a
geom_violin() in ggplot, a minimum of three
values are required. To justify using a violin, I would again suggest
sticking to a similar rule of thumb of a minimum ~30
samples/observations to ensure an accurate representation of the
distribution.
The nuanced visualization of a violin plot gives much more information than the boxplot itself and most boxplots can be replaced with a violin plot. Despite the gateway to more detailed distribution information, this format remains less popular/familiar to scientists. Therefore its immediate accessibility to your audience can be limited.
An important parameter of this geom is scale: this sets
how big each violin is in comparison to each other. It accepts the
following values:
area: default value so all violins will have the
same area.
count: scale areas proportionately to
observations.
width: all violins have the same maximum width.
Total area is not the same for each.
Outliers and violins: Much like a KDE, the theoretical distribution of a violin plot can generate some impossible values - especially at the tails. Remember that this is a theoretical distribution based on the data supplied. Depending on how much variation is in your data, and how many outliers it has, it can really affect the shape of your violin.
# 5.5.0 Generate a violin plot
sunshineFinal.df %>%
# Filter for University data from 1996
filter(sector %in% c("Universities"),
year %in% c(1996, 2023),
employer %in% universityList) %>% # filter on our list of employers
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=employer, y = salary, fill = factor(year)) +
# Set some configurations for the plot
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 3. Scaling
ylim(0, 500000) +
# 4. Geoms
### 5.5.0 Add the violin geom
geom_violin(scale = "width") +
# Add datapoints as a quasirandom beeswarm
geom_quasirandom(dodge.width = 0.85, # How far apart to set our groups
width = 0.15, # How wide to make each beeswarm
alpha = 0.05) # Set the opacity very low since we expect a lot of points!
## Warning: [1m[22mRemoved 19 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: [1m[22mRemoved 19 rows containing missing values or values outside
## the scale range (`position_quasirandom()`).
Looking at our violin plot we can see just how closely the
geom_quasirandom() points match our violin shape. Depending
on the density of your observations, you could easily drop the data
points to make things a little cleaner.
The major advantage to the violin plot is that, by it’s nature, it is very sensitive to the distribution that produces the density estimate. The boxplot represents the summary information of a distribution but is always a visual representation of a normal distribution. There are not enough parameters supplied to represent anything more complex!
The violin plot is not limited in that respect. Despite some of it’s visual caveats, it can certainly detect multi-modal data. Let’s make a toy example to illustrate using that Judiciary data we saw back in section 4.3.0.
# 5.6.0 Generate a combined violin and boxplot of bimodal data
sunshineFinal.df %>%
# Filter for Judiciary data
filter(sector %in% c("Judiciary"),
year %in% c(2008:2010, 2020:2023)) %>%
mutate(year = factor(year)) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=salary, y = year) +
# Set some configurations for the plot
theme_bw() +
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 3. Scaling
xlim(0, 500000) +
# 4. Geoms
geom_violin(scale = "width", colour="darkviolet") +
geom_boxplot(alpha = 0.6, colour = "blue") +
geom_quasirandom(alpha = 0.6, colour = "darkgreen")
## [1m[22mOrientation inferred to be along y-axis; override with
## `position_quasirandom(orientation = 'x')`
In our toy model with the Judiciary data, we can see that there really are 2 distinct populations. The datapoints and violins both mirror this. However, looking closely at the boxplot, we only see a long distribution of the data. Without plotting the datapoints, you would not know if the data is spread evenly over the boxplot’s area (remember it contains the middle 50% of the data) or if it is hiding a bimodal distribution.
Take-home message: while boxplots are great and easy to interpret on the surface, always ask to see the datapoints or a violin instead!
From our above example, you can see that we blended a number of
geoms together. With a little working around, we can also
plot both violins and boxplots together in a multivariate setting! This
gives us the familiarity of the boxplot but also clearly displays the
theoretical distribution. Some steps to accomplish this:
scale
parameter to “width”.aes() parameters for our geoms
directly to ensure our points are plotted correctly.geom_quasirandom() layer since our violins
mirror the same shape so well.# 5.7.0 Generate a combination violin plot with boxplot
sunshineFinal.df %>%
# Filter for University data from 1996
filter(sector %in% c("Universities"),
year %in% c(1996, 2023),
employer %in% universityList) %>% # filter on our list of employers
mutate(year = factor(year)) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=employer, y = salary) +
# Set some configurations for the plot
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 3. Scaling
ylim(0, 500000) +
scale_colour_manual(values=c("black", "black")) + # we'll need this to fix our boxplots
# 4. Geoms
### 5.7.0 Add the violin geom
geom_violin(scale = "width", aes(fill = year)) +
### 5.7.0 Boxplot but smaller width so they reside "within" the violin plot.
# Do you think the order matters?
geom_boxplot(aes(colour = year), width=0.2,
position = position_dodge(width=0.9),
outlier.shape=NA) # Remove the outliers
## Warning: [1m[22mRemoved 19 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: [1m[22mRemoved 19 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Section 5.7.0 Comprehension Question: Looking at the above code for our graph, why do you think we set the aes(fill/group = year) aesthetic individually for some layers rather than directly in the top aes() layer?
While the above visualization shows with some clarity the total distribution of salary across some of our universities, the messiness of outliers has invaded into some of the violin plots themselves. An experienced eye, however can still see that the bulk of the population centres around our internal boxplots offset by the stretched out look of our violins. We could, of course clean it up by removing some of the outliers ahead of time but I generally keep outliers as a visual reminder that experiments aren’t perfect!
Suppose, however, we wanted to add more sectors to our data like “Colleges” across a larger timespan like 15 years? With so many potential employers in each sector, things would start to get very crowded. To accommodate all of that data, you could facet it into multiple groups based on the sectors, for instance, but this would separate the datasets from each other. Sometimes things look sharper when you can see it all on a single plot.
Organizing multiple groups, across multiple categories is the domain of the parallel coordinate plot. To accomplish this feat we’ll need to do a couple of key things:
group parameter made from multiple variables.
Normally, the group parameter can only be set to a single
variable (column) in your data. We can, however, make a new variable “on
the fly” using the interaction() function. We can supply
multiple variables from our dataset to help create this new
interaction group. This is an extremely helpful way to subgroup your
data much like using a group_by() call within the
gpplot() function.# 5.8.0 Generate a parallel coordinate plot
sunshineFinal.df %>%
# Filter for University data from 1996
filter(sector %in% c("Universities", "Colleges"),
year %in% c(2010:2023),
) %>% # filter on our list of employers
mutate(year = factor(year)) %>%
# Group the data
group_by(sector, employer, year) %>%
# Summarise to get mean salary for each
summarise(meanSalary = mean(salary)) %>%
# Revert back to a regular data frame
ungroup() %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=year, y = meanSalary, colour = sector) +
# Set some configurations for the plot
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 3. Scaling
ylim(100000, 200000) +
# 4. Geoms
### 5.8.0 A line plot to join the points
geom_line(aes(group = interaction(sector, employer))) + # Set the group aesthetic by a dual-variable set
### 5.8.0 Add the points in also to help follow the connections
geom_point()
## [1m[22m`summarise()` has grouped output by 'sector', 'employer'. You can
## override using the `.groups` argument.
What our plot does here is connect each employer across time, showing
the mean salary for those on the sunshine list which can give us a sense
of trend for a single employer, but lines are coloured based on
sector so that we can also see trends between “Colleges” vs
“Universities”. For instance, we can see something strange in our data
from 2017 where across ALL colleges, there appeared to be a large
shift in salaries before returning to the regular trajectory.
Similarly, we see the various Universities in our data all follow a similar upward trend in their mean salary growth over time. Pretty cool right?
From our above data we see an interesting datapoint from 2021 where one of the University mean salaries really jumps for just the year. A quick sort through the summarized data shows this comes from the University of Sudbury. Let’s go back to our boxplots and look at the salaries for all individuals across a period for just the single university. We’ll combine it with a parallel coordinate plot too so we can get a better sense of what is happening.
# 5.9.0 Generate a boxplot with a parallel coordinate plot
sunshineFinal.df %>%
# Filter for University data from 1996
filter(sector %in% c("Universities"),
year %in% c(2010:2023),
employer %in% ("University Of Sudbury")
) %>%
mutate(year = factor(year)) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=year, y = salary) +
# Set some configurations for the plot
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# Set some configurations for the plot
theme(text = element_text(size = 20)) + # set text size
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
# 3. Scaling
ylim(0, 300000) +
# 4. Geoms
geom_boxplot(outlier.shape = NA) +
### 5.9.0 Add datapoints like a strip plot
geom_point(aes(colour = numericID)) +
### 5.9.0 Use a line plot and link our points by numericID
geom_line(aes(group = numericID,
colour = numericID))
Although we don’t have any specific names from our list, it is clear that between 2020 and 2021 a LARGE number of individuals a the University of Sudbury had a major increase in salary but stopped receiving one altogether in 2022. This could be indicative of some kind of payout/severance or a early retirement package at the end of 2021.
One individual, however, also appears to have doubled their salary between 2021 and 2022 suggesting a promotion. A quick look suggests it is the President of the University of Sudbury that perhaps took over partway through 2021 and received a full salary in subsequent years.
We’ve covered a number of key plots today, including when and how to use them. Next week we’ll revisit some of these plots and spruce them up with extra touches that will take them that extra distance. Below you’ll find a summary of what we’ve discussed today.
| Plot | Key Features | Notes |
|---|---|---|
| Scatterplot | Good for exploring relationships between variables | Bubbleplots add an extra dimension to your data |
| Barplot | Present values across groups. | Presenting proportions, small sample sizes |
| Stack categories for extra dimension | Does not dissect individual distributions | |
| Nightingale plot | Circular-wedge barplot, same properties as barplot | Presenting data over unordered groups |
| Visual emphasis on outer area size may mislead reader | ||
| Racetrack plot | Circular-ringed barplot, same properties as barplot | Looking for a more compact way to show barplots |
| Calculate length by radians as outer rings are “stretched” | ||
| Density plot | Theoretical distribution of your sample data | Minimum sample size 4 but 30 is more reliable |
| Can plot up to 5 distributions on same axis | ||
| Tails can produce “ghost” data | ||
| Ridgeplots | Allows tighter visualizations of multiple densities | Good way to pack more KDEs into a smaller area |
| Same properties as KDEs | No real control of outliers | |
| Similar “ghost” data issues | ||
| Box and whisker plot | Summarize distributions with 5 parameters | Popular and compact presentation of simple populations |
| Minimum sample size = 5 | ||
| Does not properly visualize multi-modal data | ||
| Violin plots | Boxplot format with KDE violin shape | Compact representation of distribution shape |
| Less popular with nuanced interpretation | ||
| Inherits “ghost” data and other properties of KDE | ||
| DOES interpret multi-modal populations | ||
| Parallel coordinate plot | Visual representation of multivariate data | Connects trends across groups |
| Related data can be connected linearly | Not limited by number of samples | |
| Can help identify trends within multicategorical data |
This week’s assignment will be found under the current lecture folder under the “assignment” subfolder. It will include an R markdown notebook that you will use to produce the code and answers for this week’s assignment. Please provide answers in markdown or code cells that immediately follow each question section.
| Assignment breakdown | ||
|---|---|---|
| Code | 50% | - Does it follow best practices? |
| - Does it make good use of available packages? | ||
| - Was data prepared properly | ||
| Answers and Output | 50% | - Is output based on the correct dataset? |
| - Are groupings appropriate | ||
| - Are correct titles/axes/legends correct? | ||
| - Is interpretation of the graphs correct? |
Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.
You can save and download the markdown notebook in its native format. Submit this file to the the appropriate assignment section by 12:59 pm on the date of our next class: March 27th, 2025.
Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.2: edited and prepared for CSB1020H S LEC0141, 03-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 2.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2024 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 3.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2025 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Public Ontario Salary Data: https://www.ontario.ca/page/public-sector-salary-disclosure
Anscombe’s Quartet: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Choosing the right chart type: https://extremepresentation.typepad.com/blog/2006/09/choosing_a_good.html
A layered grammar of graphics by Hadley Wickham: http://vita.had.co.nz/papers/layered-grammar.html
read_delim() column types and information: https://readr.tidyverse.org/reference/read_delim.html
Pie chart rules: https://www.1ka.si/d/en/help/faq/what-is-the-maximum-number-of-categories-to-display-using-pie-charts
Kernel density estimation: https://stats.stackexchange.com/questions/76948/what-is-the-minimum-number-of-data-points-required-for-kernel-density-estimation
Ridgeline plots: https://www.data-to-viz.com/graph/ridgeline.html
Visualizing samples with boxplots: https://www.nature.com/articles/nmeth.2813.pdf?origin=ppub#:~:text=Whereas%20histograms%20require%20a%20sample,render%20it%20even%20more%20informative.
Reasons to use a violin plot: https://blog.bioturing.com/2018/05/16/5-reasons-you-should-use-a-violin-graph/
Parallel coordinate plot parameters: https://www.rdocumentation.org/packages/GGally/versions/1.5.0/topics/ggparcoord
Lots of more plots in R!: https://www.r-graph-gallery.com/index.html
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.